This report explores a data set containing different attributes and quality scores of 1599 red wines.
The data set source was created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
It can be found at: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
## [1] 1599 13
Our data set contains 1599 observations, and has 13 variables.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] 0
There are no NA values in our wine data set.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The two things that stick out the most are that there are 0 values for citric acid, and the huge range of total sulfur dioxide.
Let’s explore these discrepancies as we take a look at the graphs of each variable.
Our data is in tidy form and is complete. My overall objective is to find out which variables most affect the quality rating for each wine, so let’s begin by taking a look at how many wines we have at each quality level.
Nearly all wines are average. No wines scored less than 3 or more than 8 so none were terrible or exceptional. In addition, the majority of wines received a 5 or 6 for quality. Will this lack of differentiation allow us to find any insights?
##
## (2,3] (3,4] (4,5] (5,6] (6,7] (7,8]
## 10 53 681 638 199 18
We see the number of wines at each level of quality. We see that 1319 of the 1599 wines received an average score of 5 or 6. That’s over 82% of the wines.
Let’s take a look at the other variable counts beginning with alcohol as I suspect it will have the largest affect on quality ratings.
There are a few wines that give alcohol percentage to more than one decimal points. Let’s set a bin width to make the graph easier to read.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The lowest alcohol percentage is 8.4 and the highest is 14.9. The percentage tops at 9.5% and there are fewer wines as alcohol percentage increases. I am curious how quality relates to alcohol percentage as well as how different combinations of alcohol and the other variables relate to quality. For example, does high alcohol and high sugar score better than high alcohol and low sugar or vice versa?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity is pretty much normal with a few outliers. It peaks around 7 units, with some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most of the wines have a volatile acidity between 0.2 and 1. Let’s look a little bit closer at this interval and increase our bin width slightly to better observe the peaks.
We have some peaks at 0.4, 0.5, and 0.6. Could this be that wines look to get those exact levels or just that some wines round their amounts differently?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
This is a strange distribution, but very little in terms of outliers.
Increasing binwidth and taking the log, we see a nearly uniform distribution that tapers down after 0.5. The noise from the original graph can likely be attributed to binwidth.
While there is a disproportionate amount of 0 values, they don’t seem too out of place. Also, because citric acid has such a strong flavor, it is used less frequently than other types of acid. This could help explain why many wines don’t contain any of it.
A wine’s acidic taste profile is determined by its total acidity. Let’s create the total acidity variable by combining fixed and volatile acidity. Citric acid is included in fixed acids, so won’t be added to total acidity.
We will use this variable later in the analysis to compare different flavor profiles in our wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Almost all wines have sugar levels between 1 and 3, but there are some major outliers with a max of 15.5. Why do some wines have so much sugar? How does it affect the other variables especially quality? Let’s take a closer look at this and adjust bin width.
A closed look at the bulk of wines shows a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
This graph looks quite a lot like the sugar graph, normal at a certain interval but with a lot of outliers. This is demonstrated by the huge difference between Q3 and the max.
When zooming in closer and changing the bin width, it is much easier to see the normality of the graph. Although there are a lot of outliers, the levels are still very low at just over 0.6 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The graph for free sulfur dioxide is skewed right. Let’s see if we can glean any information from the log of this graph.
Outside of a few outliers, there is nothing unusual about this graph. Nearly all wines contain fewer than 40 units of free sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Earlier, it was interesting to see the large range of total sulfur dioxide in the wines. We see here that this discrepancy is caused by only a few extreme outliers and so is not very important to investigate.
The graph of total sulfur dioxide is very similar to that of free sulfur dioxide. This is not much of a surprise. To be sure, let’s check the log of this graph as well.
Like the free sulfur graph, there is not much of note with this graph other than a couple of blank levels. In the previous graph we did see some extreme outliers but nearly all wines contained fewer than 150 units of total sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
From the histogram, it appears that a lot of the wines gave less accurate entries for density. Our boxplot shows a normal distribution with some outliers. Let’s adjust the bin width of the first plot to get a better look.
This graph is normal and nearly all wines fall within a small range of 0.99 to 1.005.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Like nearly all of the variables, pH is normal and has some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The majority of wines contain 0.5 to 0.7 levels of sulphates, and very few have more than 1. Yet some of the wines contain over 1.5 sulphates, but they are only a fraction of the population.
The data set contains 1599 red wines each of which has 12 features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality.
Other observations: All observations with the exception of citric acid are normally distributed. Nearly all of the wines got a middle of the road quality score of 5 or 6. No wines scored above 8 or less than 3. Many of the variables have extreme outliers.
The main feature is quality. My goal is to determine which of the other features of the wines most determine its quality score.
I believe that alcohol, the three types of acids, and residual sugar will have the highest influence on the quality of the wines as they most contribute to its flavor.
A wines taste is mainly determined by four factors. Acidity is one of those factors. I created a new variable, total.acidity, by combining fixed.acidity and total.acidity. This variable will be used to represent a wine’s acid profile.
I also created the variable single.quality.buckets which is an ordered factor variable of quality.
Citric Acid was the only variable with an odd distribution. The rest of the variables had either normal or skewed right distributions. The data came in tidy form and had no NA values, so no adjustments were necessary.
With so many variables, lets look at a correlation matrix to see how variables relate to each other.
Let begin with the factors with the highest correlation to quality. The variables that correlate most with quality are alcohol, sulphates, volatile acidity, and citric acid. Quality has a moderate correlation with alcohol and a weak correlation with sulphates, volatile acidity, and citric acid. Let’s see if the graphs reflect this.
There is far too much over plotting in this graph to make any conclusions.
Adding some noise, we begin to see that quality increases as alcohol does, but let’s see if there is anything we can do to make the relationship more clear.
From this graph, we can see a clear correlation. Quality scores between 3 and 5 have similar levels of alcohol averages, but at quality scores higher than 5 we see that alcohol levels are higher as quality increases.
Let’s now investigate our variable that has the second highest correlation with quality, volatile acids.
Skipping immediately to our box plot we see an immediate pattern, the negative correlation between quality and volatile acidity. According to winefolly.com, acidity is what gives wine its tart and sour taste. It makes sense that too much of this flavor would lower a wine’s score.
In the univariate section, we noticed some outliers for volatile acid. Let’s take a quick look into the wines that had volatile acidity levels higher than 1.
Surprisingly, several of the high volatile acidity wines still received average scores. Unsurprisingly, 3 out of the 10 wines that received the lowest score of 3 had high levels.
Let’s now investigate the variable with the next highest correlation to quality, sulphates.
Median sulphates increase at each incremental level of quality. Mean does as well except between 3 and 4 of quality, but not by much and there aren’t very many wines with at those levels. The general trend shows quality increasing as sulphates increase.
Earlier in the univariate section, we saw some wines with very high sulphate levels above 1.5. The graph above shows that quality increases as sulphates do, but none of the high sulphate wines had enough data points to be represented in the graph above. Let’s see what the graph of those outliers looks like.
It appears that the sulphate outliers buck the trend of increased quality as none of the high sulphate wines scored above 6 and the wine with the highest sulphate level scored an awful 4. There aren’t enough observations with these high numbers to make any conclusions.
Let’s take a look at the variable with the fourth highest correlation to quality, citric acid.
There is a lot more variation here than the previous graphs, but focusing on the averages, We can see citric levels trending higher as quality increases and thus there is a correlation. Wines with higher levels of citric acid received higher quality scores.
We were able to provide visualizations to support the correlation between quality and the four variables with the highest r-squared values.
While I am far from a wine connoisseur, I tend to judge wines primarily by taste. According to winesandvines.com, balance between sweetness, alcohol, acid, and tannin is important to wine quality. Let residual sugar represent sweetness, and total acidity represent acid. We don’t have a good variable to represent tannin, but let’s see how the other three variables relate to each other.
There doesn’t see to be any correlation between the alcohol and total acidity, but I am curious to know if there is any combination between the two variables that helps determine quality.
Again, there doesn’t seem to be much of a relationship between total acidity and sugar, but again, let’s check in the next section if there is are specific combinations between the two that help determine quality.
The graph looks nearly identical to the previous one. I had heard that sugar is added to many wines that have a high alcohol content in order to mask the burning taste. Since residual sugars are the sugars left over after fermentation, it makes sense to see that the wines with the highest residual sugar levels have low levels of alcohol. Maybe residual sugar is a bad representation of sweetness levels in wine as it doesn’t seem to include any added sugar.
There doesn’t seem to be much much of a pattern between the three variables when graphed against each other. However, I would like to explore how the three variables compare also graphed with quality to see if there are any combinations of the three variables lead to better quality scores.
Our feature of interest is quality. I checked it’s relationship to the four variables that had a correlation coefficient of at least 2.0: alcohol, volatile acid, sulphates, and citric acid. When graphing the other features against quality, there was a lot of variance and it was hard to produce a linear approximation. For alcohol, volatile acid, and sulphates graphing the mean of quality produced a relatively linear relationship. For citric acid, a density graph along each level of quality showed the correlation.
Looking at the relationships between the three main flavor attributes, sweetness (residual sugar), alcohol, and acidity (total acidity), I was unable to find any relationships when graphed against each other. I’m hoping to find something interesting when quality is also included in the next section.
The strongest relationship I found was the positive correlation between alcohol and quality. I am interested to see how this relationship changes in the next section when alcohol is compared to quality at different levels of acidity and sugar.
In the previous section we checked the relationships between alcohol, total acidity, and residual sugar, but were unable to find any correlations between them. Let’s begin this section by revisiting those relationships but throwing quality into the mix.
As earlier, let’s begin with the alcohol vs total acidity.
Adding quality to the visualization shows us that wines with high alcohol and total acidity scored better on average than wines with lower levels. However, wines tend to be balanced in that those with the same quality score have either high comparative alcohol levels or high comparative total acidity levels, but not both.
There doesn’t seem to be any pattern to how acidity and sweetness together affect quality. We can see both high and low quality wines at nearly all combination levels of the two variables.
In this graph we can see a clear pattern with quality. Unfortunately, quality only seems to be affected by alcohol. As alcohol level increases so does quality, but as we increase along the x-axis quality stays the same as residual sugar increases. It’s a shame we aren’t provided with total sugar data. I believe that if we had a variable that better represented the sweet flavor profile of wines, we would have gotten more interesting results from the previous two visualizations.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wines)
## m2: lm(formula = I(quality) ~ I(alcohol) + sulphates, data = wines)
## m3: lm(formula = I(quality) ~ I(alcohol) + sulphates + volatile.acidity,
## data = wines)
## m4: lm(formula = I(quality) ~ I(alcohol) + sulphates + volatile.acidity +
## citric.acid, data = wines)
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) 1.875*** 1.375*** 2.611*** 2.646***
## (0.175) (0.177) (0.196) (0.201)
## I(alcohol) 0.361*** 0.346*** 0.309*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## sulphates 0.994*** 0.679*** 0.696***
## (0.102) (0.101) (0.103)
## volatile.acidity -1.221*** -1.265***
## (0.097) (0.113)
## citric.acid -0.079
## (0.104)
## ----------------------------------------------------------------------------
## R-squared 0.227 0.270 0.336 0.336
## adj. R-squared 0.226 0.269 0.335 0.334
## sigma 0.710 0.690 0.659 0.659
## F 468.267 294.988 268.912 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1675.142 -1599.384 -1599.093
## Deviance 805.870 760.894 692.105 691.852
## AIC 3448.114 3358.284 3208.768 3210.186
## BIC 3464.245 3379.793 3235.654 3242.448
## N 1599 1599 1599 1599
## ============================================================================
The linear model can only account for 34% of variance, and citric acid didn’t improve that number at all.
There was an interesting interaction between alcohol and total acidity. The feature of interest, quality, would increase as alcohol and acidity increased, but only one of the variables would increase in comparison to the other. It appears as though strong flavors are preferred, but only one strong flavor and not both.
I found it surprising that residual sugar had little to do with quality scores when compared in conjuncture with another flavor variable. Since sweetness is a major component of flavor, which is a major component of wine quality, and since sweetness is added to help balance out bitter, burning from alcohol, and acidic flavors, I was expecting it to have a huge effect on quality.
One possible explanation for this is that residual sugar only represents the sugars left over after fermentation and not necessarily added sugars so the variable we use to approximate the sweetness of a wine is not accurate.
Yes, I created a model using the four variables from the bivariate section.
The model is limited as the four variables accounted for only 34% of variance in quality of the wines. Citric acid did not increase improve the R^2 value at all and could be left out of the model.
Using these two graphs together gives you a lot of immediate information for your first exposure to each input variable. Right away we can see a steep normal distribution with half of the wines having a pH between 3.2 and 3.4. We can also see the outliers including the extremes having a pH of over 4.
At first glance, it was hard to see the relationship between quality and the input variables. This visualization shows lower levels of volatile acidity as quality increases. Between quality level 7 and 8 mean increases very slightly, but despite this the trend is clear. Volatile acidity has a negative correlation with quality.
This graph demonstrated that some variables, when together, affected quality differently than when they were on their own. Here, as quality was at its highest when one variable was high, but the other was relatively low. Wines that had high levels of both like the one with nearly 15% alcohol and over 15g/dm^3 of total acidity scored poorly.
The wine dataset contained 1599 different red wines. Each wine contained 11 input attributes and one output attribute. My overall objective was to find how the input attributes affected quality scores. I began by looking at each of the variables to understand them. I then began comparing some of the of the variables to quality and saw how they correlated. Some clear trends emerged, especially between alcohol and quality. Comparing some of the input variables against each other gave some surprising results. Residual sugar had little relationship with any of the other variables even when combined against quality. While I didn’t find very strong correlations between quality and the input variables, I still made an attempt to create a model. Unfortunately, the model was only able to account for 34% of variance.
Several limitations to the model exist. To begin with, it’s low R^2 value. Also, quality values were subjective, which means bias. Finally, quality values had a very narrow distribution. None of the wines received scores of 1, 2, 9, or 10; and 82% of the wines had an average score of 5 or 6. This lack of differentiation made finding correlations difficult.
For further investigations, I would look more in depth on if the wines that received high quality scores (7 or 8) had any glaring differences from those that scored in the middle (5 or 6) or low (3 or 4) tier. A larger dataset would be useful, especially one that included some exceptional (9 or 10) and terrible (1 or 2) wines. More variables would be better as well, I would be most interested in knowing the total sugar and tannin levels to get the full scope of wine flavor.